Distinct data types of features in crash_data:
[dtype('O') dtype('float64')]
First few rows of the crash data:
Crash Number City Town Name Crash Date \
0 5342297 LOWELL 01/01/2024
1 5342292 LOWELL 01/01/2024
2 5342292 LOWELL 01/01/2024
3 5342292 LOWELL 01/01/2024
4 5342292 LOWELL 01/01/2024
Crash Severity Crash Status Crash Time Crash Year \
0 Non-fatal injury Open 3:26 AM 2024.0
1 Property damage only (none injured) Open 12:48 AM 2024.0
2 Property damage only (none injured) Open 12:48 AM 2024.0
3 Property damage only (none injured) Open 12:48 AM 2024.0
4 Property damage only (none injured) Open 12:48 AM 2024.0
Max Injury Severity Reported Number of Vehicles Police Agency Type ... \
0 Possible Injury (C) 1.0 Local police ...
1 No Apparent Injury (O) 2.0 Local police ...
2 No Apparent Injury (O) 2.0 Local police ...
3 No Apparent Injury (O) 2.0 Local police ...
4 No Apparent Injury (O) 2.0 Local police ...
X Y Latitude Longitude Vehicle Unit Number Vehicle Make Vehicle Model \
0 NaN NaN NaN NaN 1.0 HOND HR-V
1 NaN NaN NaN NaN 1.0 NISS ALTIMA
2 NaN NaN NaN NaN 2.0 HOND ACCORD
3 NaN NaN NaN NaN 2.0 HOND ACCORD
4 NaN NaN NaN NaN 2.0 HOND ACCORD
Person Number Age Sex
0 1.0 32.0 F - Female
1 1.0 60.0 M - Male
2 2.0 NaN NaN
3 3.0 31.0 M - Male
4 4.0 NaN M - Male
[5 rows x 72 columns]
ANALYTICAL AVENGERS
INFO 523 - Project Final
Abstract
This study investigates the relationship between age demographics and severe crashes, with a focus on developing a predictive model to enhance road safety in Massachusetts. Using a crash dataset from January 2024, we explore how age correlates with the severity of crashes and examine environmental factors like lighting, weather, road conditions, speed limits, and the number of vehicles involved. Our analysis reveals crucial patterns, indicating which age groups, both drivers and vulnerable users, are at greater risk of severe crashes. Additionally, we identify environmental conditions that contribute to the likelihood and severity of crashes, providing insights for targeted safety measures. To classify crash severity, we experimented with various machine learning (ML) techniques, including logistic regression, decision trees, random forests, and K Nearest Neighbors (KNN). Our models achieved a 100% prediction accuracy, indicating a strong ability to classify crash severity based on the selected features. However, the absence of road volume or vehicle miles traveled data poses a limitation in contextualizing the frequency of crashes. The outcomes of our research offer valuable tools for policymakers and practitioners, allowing for more proactive safety measures and resource allocation. By accurately predicting crash risks based on age demographics and environmental conditions, authorities can implement preemptive interventions to reduce severe accidents. Ultimately, this study contributes to a data-driven approach to road safety, with the potential to make tangible improvements in public safety and traffic management.
Introduction
Understanding the factors contributing to severe car crashes is crucial for improving road safety and reducing traffic-related injuries and fatalities. This project aims to develop a predictive model that correlates age demographics with severe crashes in Massachusetts. The ultimate goal is to identify key risk factors and provide data-driven insights for implementing effective safety measures.
Our team is analyzing a comprehensive dataset of car crashes from January 2024, collected from the Massachusetts Registry of Motor Vehicles. This dataset comprises 72 dimensions, encompassing a range of variables, including crash characteristics, driver demographics, environmental conditions, and vehicle information. By examining these variables, we seek to uncover patterns that link age with severe crashes, offering valuable insights into potential high-risk groups and circumstances.
Our analysis focuses on two main research questions: identifying the age groups most at risk for severe crashes and exploring the role of environmental factors such as lighting, weather, road conditions, and speed limits. Additionally, we aim to develop a predictive model capable of classifying crash severity based on these variables. To achieve this, we used multiple binary classification models, which are known for their simplicity and effectiveness in classification tasks.
The methodology for our analysis involved several key steps. First, we pre-processed the dataset to handle missing data, standardize categorical variables, and scale numerical features. Next, we conducted exploratory data analysis to identify significant correlations and patterns. To predict crash severity, we trained a KNN model using a subset of the data and evaluated its performance on a separate test set. The model’s accuracy, precision, recall, and F1-score were measured to determine its effectiveness. The high accuracy achieved in the model’s predictions indicates its potential for real-world application in road safety.
This report details our approach to analyzing the Massachusetts crash dataset, including the steps taken to process the data, build the predictive model, and evaluate its performance. We discuss our findings and provide insights into which age groups are most at risk, along with the environmental factors that contribute to severe crashes. Through this work, we aim to contribute to road safety practices and provide useful information for policymakers, traffic safety professionals, and other stakeholders interested in reducing traffic-related incidents and enhancing public safety.
Questions
- Which age groups are at the highest risk of getting into severe crashes, and how do factors like lighting, weather, road conditions, speed limits, and the number of vehicles involved contribute to the likelihood of certain age groups being in more danger?
- Is it possible to develop a model that can accurately classify the severity of crashes based on our findings from the previous question about factors that contribute to said level of danger?
Analysis Plan
Question 1
Crash Year Number of Vehicles MassDOT District Total Fatalities \
count 25547.0 25547.000000 25547.000000 25547.000000
mean 2024.0 1.976749 4.019063 0.003562
std 0.0 0.702530 1.325421 0.068730
min 2024.0 1.000000 1.000000 0.000000
25% 2024.0 2.000000 3.000000 0.000000
50% 2024.0 2.000000 4.000000 0.000000
75% 2024.0 2.000000 5.000000 0.000000
max 2024.0 9.000000 6.000000 3.000000
Total Non-Fatal Injuries Speed Limit X Y \
count 25547.000000 23389.000000 21002.000000 21002.000000
mean 0.318824 34.394502 205930.128516 887470.383156
std 0.728140 12.979679 49539.383540 31782.135543
min 0.000000 1.000000 44708.708525 779050.104521
25% 0.000000 25.000000 179154.370652 870946.937400
50% 0.000000 30.000000 224092.943601 889548.926635
75% 0.000000 40.000000 237299.607076 908937.437400
max 8.000000 65.000000 327948.082270 958417.191000
Latitude Longitude Vehicle Unit Number Person Number \
count 20823.000000 20823.000000 25220.000000 25547.000000
mean 42.234940 -71.431249 1.489968 1.918699
std 0.287058 0.600959 0.637851 1.568750
min 41.251611 -73.386241 1.000000 1.000000
25% 42.086592 -71.756001 1.000000 1.000000
50% 42.254041 -71.209095 1.000000 2.000000
75% 42.428108 -71.049485 2.000000 2.000000
max 42.874973 -69.962834 9.000000 42.000000
Age
count 23002.000000
mean 38.952265
std 18.503512
min 0.000000
25% 24.000000
50% 36.000000
75% 53.000000
max 99.000000
Age 2548
Light Conditions 3
Weather Conditions 3
Road Surface Condition 3
dtype: int64
Age 0
Light Conditions 0
Weather Conditions 0
Road Surface Condition 0
dtype: int64
::: {#cell-Visualization of age group and crash severity .cell execution_count=7}
:::
::: {#cell-Visualizations for crash severity and Light Conditions .cell execution_count=8}
:::
::: {#cell-Visualizations for crash severity and weather Conditions .cell execution_count=10}
:::
::: {#cell-Visualizations for crash severity and road surface Conditions .cell execution_count=12}
:::
::: {#cell-Visualizations for number of crashes by Age Group and Light Conditions .cell execution_count=14}
:::
Question 2:
Analysis of Missing Values for numerical features:
Missing Values Percentage (%)
Crash Year 3 0.012189
Number of Vehicles 3 0.012189
MassDOT District 3 0.012189
Total Fatalities 3 0.012189
Total Non-Fatal Injuries 3 0.012189
Speed Limit 1984 8.060781
X 4442 18.047373
Y 4442 18.047373
Latitude 4612 18.738065
Longitude 4612 18.738065
Vehicle Unit Number 326 1.324503
Person Number 3 0.012189
Age 0 0.000000
feature_variable 0 0.000000
Analysis of Missing Values for categorical features:
Missing Values \
Crash Number 0
City Town Name 0
Crash Date 3
Crash Status 3
Crash Time 3
Max Injury Severity Reported 3
Police Agency Type 3
State Police Troop 19852
Age of Driver - Youngest Known 489
Age of Driver - Oldest Known 487
Age of Vulnerable User - Youngest Known 23848
Age of Vulnerable User - Oldest Known 23848
Crash Hour 3
Driver Contributing Circumstances (All Drivers) 668
Driver Distracted By (All Vehicles) 4628
First Harmful Event 3
Is Geocoded 3
Light Conditions 0
Manner of Collision 3
Vulnerable User Action (All Persons) 23921
Vulnerable User Location (All Persons) 23921
Vulnerable User Type (All Persons) 23864
RMV Document Numbers 70
Road Surface Condition 0
Roadway Junction Type 3
RPA Abbreviation 3
Traffic Control Device Type 3
Trafficway Description 3
Vehicle Actions Prior to Crash (All Vehicles) 3
Vehicle Configuration (All Vehicles) 38
Vehicle Emergency Use (All Vehicles) 661
Vehicle Towed From Scene (All Vehicles) 78
Vehicle Travel Directions (All Vehicles) 3
Weather Conditions 0
County Name 3
Crash Report IDs 3
FMCSA Reportable (All Vehicles) 3
FMCSA Reportable (Crash) 3
First Harmful Event Location 3
Geocoding Method 4442
Hit and Run 3
Locality 24597
Most Harmful Event (All Vehicles) 301
Road Contributing Circumstance 16861
School Bus Related 3
Traffic Control Device Function 3
Vehicle Sequence of Events (All Vehicles) 235
Work Zone Related 3
Vulnerable User Sequence of Events (All Persons) 24594
Vulnerable User Distracted By (All Persons) 24598
Vulnerable User Traffic Control Type (All persons) 24596
Vulnerable User Origin Destination (All Persons) 24596
Vulnerable User Contributing Circumstances (All... 24598
Vulnerable User Alcohol Suspected Type (All Per... 24599
Vulnerable User Drug Suspected Type (All Persons) 24599
Vehicle Make 818
Vehicle Model 5149
Sex 1556
Age Group 0
Weather Group 22961
Percentage (%)
Crash Number 0.000000
City Town Name 0.000000
Crash Date 0.012189
Crash Status 0.012189
Crash Time 0.012189
Max Injury Severity Reported 0.012189
Police Agency Type 0.012189
State Police Troop 80.656564
Age of Driver - Youngest Known 1.986755
Age of Driver - Oldest Known 1.978629
Age of Vulnerable User - Youngest Known 96.891886
Age of Vulnerable User - Oldest Known 96.891886
Crash Hour 0.012189
Driver Contributing Circumstances (All Drivers) 2.714013
Driver Distracted By (All Vehicles) 18.803072
First Harmful Event 0.012189
Is Geocoded 0.012189
Light Conditions 0.000000
Manner of Collision 0.012189
Vulnerable User Action (All Persons) 97.188478
Vulnerable User Location (All Persons) 97.188478
Vulnerable User Type (All Persons) 96.956893
RMV Document Numbers 0.284403
Road Surface Condition 0.000000
Roadway Junction Type 0.012189
RPA Abbreviation 0.012189
Traffic Control Device Type 0.012189
Trafficway Description 0.012189
Vehicle Actions Prior to Crash (All Vehicles) 0.012189
Vehicle Configuration (All Vehicles) 0.154390
Vehicle Emergency Use (All Vehicles) 2.685573
Vehicle Towed From Scene (All Vehicles) 0.316906
Vehicle Travel Directions (All Vehicles) 0.012189
Weather Conditions 0.000000
County Name 0.012189
Crash Report IDs 0.012189
FMCSA Reportable (All Vehicles) 0.012189
FMCSA Reportable (Crash) 0.012189
First Harmful Event Location 0.012189
Geocoding Method 18.047373
Hit and Run 0.012189
Locality 99.934994
Most Harmful Event (All Vehicles) 1.222931
Road Contributing Circumstance 68.504449
School Bus Related 0.012189
Traffic Control Device Function 0.012189
Vehicle Sequence of Events (All Vehicles) 0.954780
Work Zone Related 0.012189
Vulnerable User Sequence of Events (All Persons) 99.922805
Vulnerable User Distracted By (All Persons) 99.939057
Vulnerable User Traffic Control Type (All persons) 99.930931
Vulnerable User Origin Destination (All Persons) 99.930931
Vulnerable User Contributing Circumstances (All... 99.939057
Vulnerable User Alcohol Suspected Type (All Per... 99.943119
Vulnerable User Drug Suspected Type (All Persons) 99.943119
Vehicle Make 3.323447
Vehicle Model 20.919839
Sex 6.321862
Age Group 0.000000
Weather Group 93.288100
Crash Number City Town Name Crash Date Crash Status Crash Time Crash Year \
0 5342297 LOWELL 01/01/2024 Open 3:26 AM 2024.0
1 5342292 LOWELL 01/01/2024 Open 12:48 AM 2024.0
2 5342292 LOWELL 01/01/2024 Open 12:48 AM 2024.0
3 5342292 LOWELL 01/01/2024 Open 12:48 AM 2024.0
4 5342292 LOWELL 01/01/2024 Open 12:48 AM 2024.0
Max Injury Severity Reported Number of Vehicles Police Agency Type \
0 Possible Injury (C) 1.0 Local police
1 No Apparent Injury (O) 2.0 Local police
2 No Apparent Injury (O) 2.0 Local police
3 No Apparent Injury (O) 2.0 Local police
4 No Apparent Injury (O) 2.0 Local police
Age of Driver - Youngest Known ... Latitude Longitude \
0 25-34 ... 42.339231 -71.207633
1 55-64 ... 42.339231 -71.207633
2 55-64 ... 42.339231 -71.207633
3 55-64 ... 42.339231 -71.207633
4 55-64 ... 42.339231 -71.207633
Vehicle Unit Number Vehicle Make Vehicle Model Person Number Age \
0 1.0 HOND HR-V 1.0 32.0
1 1.0 NISS ALTIMA 1.0 60.0
2 2.0 HOND ACCORD 2.0 36.0
3 2.0 HOND ACCORD 3.0 31.0
4 2.0 HOND ACCORD 4.0 36.0
Sex Age Group feature_variable
0 F - Female 25-35 1
1 M - Male 60 and above 0
2 M - Male 35-50 0
3 M - Male 25-35 0
4 M - Male 35-50 0
[5 rows x 58 columns]
Crash Number City Town Name Crash Date Crash Status Crash Time Crash Year \
0 5342297 LOWELL 01/01/2024 Open 3:26 AM 0.0
1 5342292 LOWELL 01/01/2024 Open 12:48 AM 0.0
2 5342292 LOWELL 01/01/2024 Open 12:48 AM 0.0
3 5342292 LOWELL 01/01/2024 Open 12:48 AM 0.0
4 5342292 LOWELL 01/01/2024 Open 12:48 AM 0.0
Max Injury Severity Reported Number of Vehicles Police Agency Type \
0 Possible Injury (C) -1.388213 Local police
1 No Apparent Injury (O) 0.027963 Local police
2 No Apparent Injury (O) 0.027963 Local police
3 No Apparent Injury (O) 0.027963 Local police
4 No Apparent Injury (O) 0.027963 Local police
Age of Driver - Youngest Known ... Latitude Longitude Vehicle Unit Number \
0 25-34 ... 0.325956 0.330683 -0.760059
1 55-64 ... 0.325956 0.330683 -0.760059
2 55-64 ... 0.325956 0.330683 0.806457
3 55-64 ... 0.325956 0.330683 0.806457
4 55-64 ... 0.325956 0.330683 0.806457
Vehicle Make Vehicle Model Person Number Age Sex \
0 HOND HR-V -0.587190 -0.376621 F - Female
1 NISS ALTIMA -0.587190 1.193908 M - Male
2 HOND ACCORD 0.041754 -0.152259 M - Male
3 HOND ACCORD 0.670698 -0.432711 M - Male
4 HOND ACCORD 1.299642 -0.152259 M - Male
Age Group feature_variable
0 25-35 1
1 60 and above 0
2 35-50 0
3 25-35 0
4 35-50 0
[5 rows x 58 columns]
Crash Year Number of Vehicles MassDOT District Total Fatalities \
0 0.0 -1.388213 -0.016167 -0.052805
1 0.0 0.027963 -0.016167 -0.052805
2 0.0 0.027963 -0.016167 -0.052805
3 0.0 0.027963 -0.016167 -0.052805
4 0.0 0.027963 -0.016167 -0.052805
Total Non-Fatal Injuries Speed Limit X Y Latitude \
0 0.905249 0.056513 0.3275 0.321189 0.325956
1 -0.447732 -0.343256 0.3275 0.321189 0.325956
2 -0.447732 -0.343256 0.3275 0.321189 0.325956
3 -0.447732 -0.343256 0.3275 0.321189 0.325956
4 -0.447732 -0.343256 0.3275 0.321189 0.325956
Longitude ... Vehicle Model_1228 Sex_0 Sex_1 Sex_2 Sex_3 \
0 0.330683 ... False True False False False
1 0.330683 ... False False True False False
2 0.330683 ... False False True False False
3 0.330683 ... False False True False False
4 0.330683 ... False False True False False
Age Group_0 Age Group_1 Age Group_2 Age Group_3 Age Group_4
0 False True False False False
1 False False False True False
2 False False True False False
3 False True False False False
4 False False True False False
[5 rows x 36680 columns]
Selected features:
Index(['Total Non-Fatal Injuries', 'Person Number', 'Crash Number_12',
'Crash Time_361', 'Max Injury Severity Reported_2',
'Max Injury Severity Reported_5', 'Max Injury Severity Reported_7',
'Crash Hour_0', 'Driver Contributing Circumstances (All Drivers)_125',
'Driver Contributing Circumstances (All Drivers)_471',
...
'Crash Report IDs_3929', 'Crash Report IDs_3941',
'First Harmful Event Location_2', 'First Harmful Event Location_4',
'Most Harmful Event (All Vehicles)_132',
'Traffic Control Device Function_1',
'Traffic Control Device Function_3',
'Vehicle Sequence of Events (All Vehicles)_542', 'Vehicle Make_269',
'Age Group_0'],
dtype='object', length=7410)
Shape of X_train: (19690, 7410)
Shape of X_test: (4923, 7410)
Shape of y_train: (19690,)
Shape of y_test: (4923,)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
Classifier: LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')
Accuracy: 1.0
Precision: 1.0
Recall: 0.9990732159406858
F1-Score: 0.9995363931386184
Classifier: DecisionTreeClassifier()
Accuracy: 0.9999492127983748
Precision: 1.0
Recall: 0.9990732159406858
F1-Score: 0.9995363931386184
Classifier: RandomForestClassifier()
Accuracy: 1.0
Precision: 1.0
Recall: 0.9990732159406858
F1-Score: 0.9995363931386184
Classifier: KNeighborsClassifier()
Accuracy: 0.9997968511934993
Precision: 1.0
Recall: 0.9981464318813716
F1-Score: 0.9990723562152134
Logistic Regression Accuracy: 1.0
Decision Tree Accuracy: 1.0
Random Forest Accuracy: 1.0
KNN Accuracy: 1.0